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ABSTRACT We analyzed the transcriptome of Escherichia coli K-12 by strand-specific RNA sequencing at single-nucleotide reso- 
lution during steady-state (logarithmic-phase) growth and upon entry into stationary phase in glucose minimal medium. To 
generate high-resolution transcriptome maps, we developed an organizational schema which showed that in practice only three 
features are required to define operon architecture: the promoter, terminator, and deep RNA sequence read coverage. We pre- 
cisely annotated 2,122 promoters and 1,774 terminators, defining 1,510 operons with an average of 1.98 genes per operon. Our 
analyses revealed an unprecedented view of E. coli operon architecture. A large proportion (36%) of operons are complex with 
internal promoters or terminators that generate multiple transcription units. For 43% of operons, we observed differential ex- 
pression of polycistronic genes, despite being in the same operons, indicating that E. coli operon architecture allows fine-tuning 
of gene expression. We found that 276 of 370 convergent operons terminate inefficiently, generating complementary 3' tran- 
script ends which overlap on average by 286 nucleotides, and 136 of 388 divergent operons have promoters arranged such that 
their 5' ends overlap on average by 168 nucleotides. We found 89 antisense transcripts of 397-nucleotide average length, 7 unan- 
notated transcripts within intergenic regions, and 18 sense transcripts that completely overlap operons on the opposite strand. 
Of 519 overlapping transcripts, 75% correspond to sequences that are highly conserved in E. coli (>50 genomes). Our data ex- 
tend recent studies showing unexpected transcriptome complexity in several bacteria and suggest that antisense RNA regulation 
is widespread. 

IMPORTANCE We precisely mapped the 5' and 3' ends of RNA transcripts across the E. coli K-12 genome by using a single- 
nucleotide analytical approach. Our resulting high-resolution transcriptome maps show that ca. one-third of E. coli operons are 
complex, with internal promoters and terminators generating multiple transcription units and allowing differential gene expres- 
sion within these operons. We discovered extensive antisense transcription that results from more than 500 operons, which fully 
overlap or extensively overlap adjacent divergent or convergent operons. The genomic regions corresponding to these antisense 
transcripts are highly conserved in E. coli (including Shigella species), although it remains to be proven whether or not they are 
functional. Our observations of features unearthed by single-nucleotide transcriptome mapping suggest that deeper layers of 
transcriptional regulation in bacteria are likely to be revealed in the future. 
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Escherichia coli burst into the realm of model organisms with the 
discovery of conjugation by Joshua Lederberg in 1946 (1). Just 
15 years later, Francois Jacob and Jacque Monod proposed the 
operon model in E. coli (2). For two-thirds of a century, E. coli has 
been an important vehicle for scientific investigation, playing a 
role in research resulting in no fewer than 10 Nobel prizes ( 1-10). 
The E. coli K-12 genome was among the early ones sequenced (11) 
and E. coli is unique among model organisms, possessing bio- 
chemical or genetic evidence for functions for ca. 75% of its 
known genes, making it arguably the best understood organism 
(12). Examination of its genome sequence confirmed what had 



long been surmised, that genes of related function are frequently 
arranged in operons (13-15). 

Soon after the discovery of the lac operon, it became clear that 
not all operons are transcribed as discrete units of information 
neatly arranged end to end on the genome. First, it was recognized 
that regions of phage lambda are transcribed on complementary 
strands (16). Over the next 40 years, operons were studied, one or 
two at a time, in line with the technology of the day, revealing 
occasional glimpses of transcriptional complexity arising from 
overlapping, divergent (17, 18) and convergent operons (19, 20). 
Second, the perception of transcriptome complexity was forever 
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changed when it was found that at least one antisense (AS) tran- 
scription start site is associated with nearly one-half (46%) of Hel- 
icobacter pylori genes (21). There is also a substantial amount of AS 
transcription in E. coli (22-24). While some researchers suggested 
that extensive AS transcription is a "by-product" of the transcrip- 
tion machinery, largely because AS transcripts did not appear to 
be conserved in enteric bacteria (25), others concluded the oppo- 
site, that AS RNA has an important role in transcriptional regula- 
tion (26-32). The recent identification and sequencing of 316 po- 
tentially functional double-stranded RNAs in E. coli is a step 
toward laying the argument to rest (33). The "excludon" concept 
of AS RNA control of divergent operons ascribes an important 
function to overlapping, complementary transcripts (34). A re- 
cent study of Staphylococcus aureus suggests that AS transcripts 
drive RNase Ill-mediated RNA processing, although a compari- 
son of the AS RNA content of selected bacteria led the authors to 
infer that the mechanism is prevalent in Gram positives but absent 
in Gram negatives (30). Amid the mounting evidence for tran- 
scriptional complexity in bacteria and the finding that AS tran- 
scripts are prevalent in bacteria, we undertook a comprehensive 
transcriptome analysis of E. coli. 

RNA sequencing (RNA-Seq) offers tremendous power for 
high-resolution transcriptome analysis. However, the fullness of 
its power has yet to be realized for E. coli, because all previous 
studies of the E. coli transcriptome failed to annotate both the 5' 
and 3' transcript ends and hence operons were not precisely 
mapped. We therefore developed an organizational schema de- 
scribed herein to precisely map RNA-Seq reads across entire oper- 
ons, including both the 5' and 3' transcript ends, and to annotate 
these data in the context of the operon arrangement on the tran- 
scriptome. Though others used tiling microarray technology to 
address bacterial transcriptome organization (28,35), tiling arrays 
did not have the resolving power to define transcript ends pre- 
cisely or to elucidate operons with multiple promoters. More re- 
cent transcriptome mapping studies of E. coli relied on 5' end 
mapping to identify transcription start sites (TSSs) (36, 37). How- 
ever, our own critical examination of these data sets revealed ex- 
tensive discrepancies that call into question many candidate TSSs, 
reinforcing the need for alternative promoter-mapping strategies 
(38). Recent RNA-Seq analyses of E. coli were also unfortunately 
not designed to map transcript ends accurately because they relied 
on randomly primed cDNA synthesis (39) or they had a resolution 
of only ca. 50 nucleotides due to low sequence read coverage (40). 
The recent development of differential RNA-Seq technology al- 
lowed mapping TSSs in Helicobacter pylori (21) and Salmonella 
enterica (29, 41); however, operon architecture was not deter- 
mined because the 3 ' ends were not mapped. In evaluating these 
approaches, we recognized that identification of both 5' and 3' 
transcript ends is essential for precise mapping of transcriptional 
regulatory features. 

Considering the foundational role of E. coli in the life sciences, 
high-resolution RNA-Seq will stimulate progress by unambigu- 
ous mapping of the features that control transcription. To anno- 
tate operons and characterize their response to carbon starvation, 
we obtained a time series of RNA samples from wild-type E. coli 
K-12 BW38028 cultures grown to stationary phase on chemically 
defined, carbon source-limited (glucose) minimal medium. We 
chose these conditions because they are intrinsic to the physiology 
that allows E. coli to colonize the mammalian intestine yet survive 
in the environment until encountering a new host and, in the case 



of E. coli pathogens, cause disease (42). We analyzed these RNA 
samples by deep sequencing with a strand- specific RNA ligation 
approach (43) that ensures full read coverage and precise mapping 
of both the 5' and 3' transcript ends. In practice, only three tran- 
scriptional features are needed to define operon architecture, re- 
gardless of complexity. These are the 5' ends (promoters), the 3' 
ends (terminators), and sufficient RNA-Seq read coverage to con- 
nect the ends, which together define operons (Fig. 1 ). Our analyses 
revealed an unprecedented high-resolution view of E. coli operon 
architecture. Our analytical approach allowed us to test the hy- 
pothesis that bacterial operon structure accommodates substan- 
tial transcriptional complexity. We offer our annotated E. coli 
K-12 operon map as a community resource upon which others 
can participate in annotating additional transcriptional regulatory 
features (GEO accession no. GSE52059). 

RESULTS AND DISCUSSION 

Single-nucleotide resolved RNA-Seq data sets. E. coli K-12 has 
served as an important model organism for more than a half cen- 
tury and was the first bacterium analyzed by DNA microarray 
technology (44, 45). While several other bacteria have now been 
analyzed by RNA-Seq (21, 26, 28, 31, 41, 47-49), the limited RNA- 
Seq studies of E. coli have not provided single-nucleotide resolu- 
tion (39, 40). Herein, we used a strand-specific RNA ligation- 
based RNA-Seq strategy, which when coupled with a robust 
analytical approach, allowed us to define transcriptional features 
across the whole E. coli genome at single-nucleotide resolution. 
We acquired time series of RNA samples from duplicate cultures 
of E. coli K-12 BW38028 and its isogenic rpoS mutant BW39452 
during logarithmic- and stationary-phase growth on glucose- 
limited minimal medium (see Fig. SI in the supplemental mate- 
rial). In total, we sequenced 26 RNA samples to generate a data set 
of 72 . 1 million uniquely mapped sequence reads corresponding to 
5.5 gigabases of RNA-Seq data (see Table SI). Appropriate tem- 
poral expression of bolA, a known glucose starvation -inducible 
gene (50), confirmed that our RNA-Seq data correctly represented 
the growth conditions (Fig. 1). Our ongoing analyses of the RpoS 
regulon will be reported elsewhere. The correlation between rep- 
licate cultures was >0.96 (see Fig. SI); this level of biological rep- 
lication provides a reliable view of the E. coli K-12 transcriptome 
(Fig. 2). The data are available at GEO (accession no. GSE52059). 

We developed an in-house computational tool to convert the 
binary read alignment (BAM) files to base count (WIG) files to 
facilitate single-nucleotide resolution analyses. We normalized 
our base count data by using a strategy analogous to the total 
count approach (51) for normalizing gene-specific read align- 
ments. Accordingly, the resulting WIG files contain only the base 
location and the number of times each base is read (sequenced) 
and are > 100 -fold smaller than the sample read alignment (SAM) 
files. Advantages of this simple base count approach are several- 
fold: first, the data are inherently more computable; second, nor- 
malization of base count data makes all samples directly compa- 
rable and eliminates transcription unit (TU) length bias; third, the 
base counts of individual features can be computed and queried at 
any desired resolution from single nucleotide to an entire operon. 

Since we analyzed RNA-Seq reads at the base count level, the 
normalized base counts can be readily averaged across any range 
of bases to calculate the relative usage of transcriptional features, 
including promoters, terminators, TUs, and operons (Fig. 1). We 
empirically determined the number of bases used to calculate pro- 



2 mBio mbio.asm.org 



July/August 2014 Volume 5 Issue 4 e01442-14 



High-Resolution Bacterial Operon Architecture 



P-453576 T-454Q91 

Log Phase 




Stationary Phase 



B 



CTTC ACCG AAAAACCACC ACC AAACCTAAATATTTCTTCTTAC CTC CAATCC AAACCCTAAAAC CCC CTACTATTTAAACCC ATCC ATG ACATCTCAC CCTTCTCCC ACC AC ATATTTCATCATC ATACCTC 



bolA 




P-453576 

Stationary Phase 



S-453657 




16 
14 
12 



-t— ' 

§ 10 

O 
o 

CD 
(fi 
CC 
-Q 

CM 
CD 
O 




S-453657:T-454091 
S-453657 
' P-453576 



4 6 
Time point 



10 



yccJ 



P-1064783:T-1 066209 



Stationary Phase 



<4 



► 



wrbA 



P-1 0671 22:T-1 065880 
T- 1066062 

9.70 

* 7.68 




FIG 1 Single-nucleotide resolution of promoters and terminators in example complex operons. (A) The bolA operon contains transcription units (TUs) 
P-453657:T-454091 (red arrow) and S-453688:T-454091 (orange arrow). RNA-Seq data are shown in a JBrowse visualization of positive- strand (red) transcrip- 
tion in logarithmic- and stationary- phase samples (average from three replicates). The base count data were normalized and log 2 transformed such that track 
heights in JBrowse are directly comparable. (B) bolA promoter region showing primary promoter P-453576 and secondary promoter S-453658 at single- 
nucleotide resolution (drawn to scale). (C) Plot of promoter usage (average count of 10 bases beginning at TSS) and TU usage (average count of bases within TU) 
for 10 growth curve time points showing bolA induction upon entry into stationary phase (see Fig. SI for growth curve). (D) Terminator usage (average counts 
of 10 bases preceding and following terminator) is shown for T- 1066062, which is shared by converging operons agp on positive strand (red) and wrbA-yccJ on 
negative strand (blue). 



moter usage by comparing the single base count value at the TSS to 
3-, 5-, 10-, and 20-base averages, each beginning at the TSS. In 
practice, the shorter base count lengths were highly variable, pre- 
sumably because of staggered starts that are occasionally observed 
in primer extension experiments (52) and were frequently ob- 
served in the RNA-Seq data sets. However, a 20-base-count length 



was too long to allow discrimination of closely spaced promoters. 
We therefore used 10-base average counts for quantifying pro- 
moter usage (Fig. 1). The same 10-base average worked well for 
calculating terminator efficiency by comparing the 10-base aver- 
age counts before and after the termination site (Fig. 1 and 3). We 
used these base count values to calculate the usage of individual 
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FIG 2 Genome-wide promoter locations and annotated transcriptome map of a selected region. (A) Promoters aligned by genome location. Line heights 
correspond to normalized, TEX-enriched promoter usage values (see text for details), shown for logarithmic phase (black) and stationary phase (orange). (B) 
Annotated regulatory features of a selected region of the genome. Positive- strand RNA-Seq data (red) and negative- strand data (blue) were normalized for 
comparison between logarithmic- and stationary-phase samples. Primary promoters and corresponding TUs (red) are indicated by arrows extending from 
promoter to terminator, as are secondary promoters (orange), internal promoters (purple), and AS promoters (green). Beginning on the left, rmf is transcribed 
from a primary promoter and depending on growth conditions terminates either before or within the ycbZ-fabA operon, which has a primary promoter upstream 
of ycbZ, an internal promoter within ycbZ, and a secondary promoter upstream oifabA. matP is transcribed from primary and secondary promoters. ompA is 
transcribed from a secondary promoter in logarithmic phase and is cotranscribed from the primary promoter of the sulA-ompA operon during stationary phase. 
An AS TU that overlaps the sulA sense transcript is turned on in stationary phase. The sxy and yccF-yccS operons converge. Finally, mgsA is transcribed as an 
independent TU from a secondary promoter in logarithmic phase and also is expressed in the yccT-mgsA operon from a promoter that is active only in stationary 
phase. (C) Plot of TU base counts for ycbZ-fabA operon, colorized according to color scheme in panel B; (D) TU plot of sulA-ompA operon; (E) TU plot of yccFS 
operon; (F) TU plot of yccT-mgsA operon. 



transcription features as well as the impact of operon structure on 
relative TU and gene expression. 

Promoter mapping. Our search for promoters was driven by 
mapping of putative TSSs on the basis of (i) enrichment with 
terminator exonuclease (TEX), which degrades RNA molecules 
with 5 '-monophosphate ends and consequently enriches for 5'- 
triphosphate ends corresponding to the nucleotide initiated de 
novo by RNA polymerase (18); (ii) promoter motif analysis; (hi) 
consensus among replicate data sets; and (iv) sigma factor- specific 
RNA polymerase binding (SELEX). None of these approaches 
alone is comprehensive, because each gives rise to false-positive 
results and fails to find all TSSs (20). For example, TEX treatment 
does not enrich for some TSSs because RppH phosphatase activity 



removes the 5 '-triphosphates (53). Additionally, not all promot- 
ers have consensus motifs that can be identified by computer al- 
gorithms (54), nor do all promoters bind RNA polymerase in vitro 
(55). 

To facilitate promoter mapping, we wrote a simple algorithm 
to search for changes in base count values exceeding 2 -fold in 
replicate TEX-enriched and coverage data sets (n = 14, wild- type 
[WT] and rpoS culture samples from log and stationary phase). 
The TSSs of highly expressed genes were apparent in all 14 repli- 
cates. However, since the 14 samples represented both 
logarithmic- and stationary-phase samples, expression of some 
promoters was condition specific. In order to generate transcrip- 
tome maps that are condition independent for annotating the 
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response to many conditions in the future, we chose a consensus 
in which three replicates of either logarithmic- or stationary- 
phase samples have TSSs at the identical base locations as a start- 
ing point for promoter mapping. This conservative strategy re- 
vealed 1 1,329 putative TSSs, a value that is similar to the number 
of promoters found by Thomason and Storz (submitted for pub- 
lication), and includes known promoters of weakly expressed 
genes. However, this number far exceeds the expected promoter 
density on the E. coli genome, thus exemplifying the need to use a 
multifaceted approach to confirm promoters. To identify candi- 
date promoters missed by TSS mapping of regions that had few 
RNA-Seq reads, we employed genomic SELEX screening, which 
was developed for quick identification of genes under the control 
of specific transcription factors (57). Confirmation of tentative 
TSS's by RNAP binding was previously employed for promoter 
mapping of Salmonella enterica serovar Typhimurium (29). Sites 
that bound RpoD in vitro, exceeding a conservative signal-to- 
background ratio threshold of 3.0, and corresponded to RNA-Seq 
reads expressed in vivo identified 1,254 additional candidate pro- 
moters (see Fig. S2 in the supplemental material). 

Next, we used a bioinformatics approach to search the 50-bp 
sequences immediately upstream of the 12,583 putative TSSs for 
promoter motifs by using FIMO software (58) to screen against a 
library of E. coli promoter motifs available at DPInteract (59). We 
found it was necessary to modify the RpoD promoter library ac- 
cording to the characterization of 554 promoters by Mitchell et al. 
(60), which demonstrated that the RpoD consensus promoter has 
— 10 and —35 regions with spacing of 14 to 20 bases between 
promoter elements. The search output was restricted to promoter 
sequences correctly positioned within ±3 bases of the TSS, with 
E values corresponding to P values of <0.02. This multifaceted 
approach yielded 5,653 putative RpoD-dependent promoters, 
which we evaluated further by manual annotation, which involved 
direct visual observation. 

A visual graphic environment (JBrowse [61]) interfaced to an 
Oracle database facilitated manual annotation documentation. 
From the list of candidate promoters, we created a JBrowse track 
at the corresponding base locations, each displaying a "clickable" 
URL call to the database that automatically recorded the base lo- 
cation and allowed manual entry of metadata, including the type 
of promoter, regulatory information supported by differential ex- 
pression analysis, and comments. We annotated only promoters 
that could be experimentally associated with operons, by using 
RNA-Seq data as described in the next section. This comprehen- 
sive strategy yielded 2,122 vegetative promoters (Fig. 2), which 
more than doubled the 811 individually characterized E. coli pro- 
moters annotated in RegulonDB and calls into question several 
thousand candidate promoters that were identified by less reliable 
high-throughput strategies (35, 38). The promoter data set (see 
Table S2) is dominated by primary promoters (P), defined as the 
furthest upstream promoter in an operon (66.3%), with a lower 
number of secondary promoters (S) that are intergenic and down- 
stream of P promoters (19.6%), internal promoters (I) that are 
intragenic (9.8%), and AS (4.2%) promoters. All possible arrange- 
ments and orientations of these promoter types were observed 
and collectively generated substantial complexity in the transcrip- 
tome (Fig. 2). 

It is well known that promoter strength, i.e., quality, varies 
greatly (60) and that variability is reflected in our data set. To 
quantify promoter quality, we scored the four criteria (metrics) 



used to map candidate promoters (see Table S2). The promoter 
quality score was calculated by applying a weighted matrix on the 
basis of 10 points, where TEX enrichment carries a weight of 4, the 
promoter motif score carries a weight of 3, the TSS consensus 
(between replicates) score carries a weight of 2, and the SELEX 
score carries a weight of 1. The resulting analyses yielded promot- 
ers scored on a scale of 0 to 10. The TEX enrichment metric reflects 
the number of instances among four TEX replicates in which the 
ratio of TEX- treated versus non-TEX- treated base counts (10- 
base-count average beginning at the TSS) for a sample exceeded 
2 -fold. The promoter motif score was calculated in quartiles of 
E values for RpoD-dependent promoter motifs as determined by 
using FIMO. The TSS consensus score was calculated as the num- 
ber of occurrences of a TSS at a precise base location divided by the 
total number of samples evaluated (n = 14). The final metric was 
the presence or absence of SELEX- determined RpoD binding, 
which was scored as a 1 or 0. The 2,122 promoters ranged in score 
from 10 to 0.14, with the top 10% of promoters scoring above 7.8, 
the bottom 10% scoring below 2.9, and the average promoter 
scoring 5.5. 

We found no strong correlation between promoter usage (av- 
erage count of first 1 0 transcribed bases) and promoter confidence 
scores or promoter motif scores (see Fig. S3 in the supplemental 
material), which is in agreement with an earlier report (60). How- 
ever, we did find a weak correlation between promoter usage and 
TU usage (average count of bases from promoter to terminator) 
(see Fig. S3). We confirmed that TU usage and RNA half-life (62) 
(measured under similar conditions) do not correspond, as noted 
previously. Nevertheless, promoter and TU usage values do reflect 
the physiologically relevant transcript level, as the RNA concen- 
tration in the cell is determined both by the frequency of tran- 
scription initiation and the rate of RNA decay, which vary sub- 
stantially for different transcripts (62). 

Operon mapping. To annotate operons, it was also necessary 
to map the 3' transcript ends, which allowed documenting the 
connections between promoters and the corresponding down- 
stream terminators (Fig. 1). Our criteria for operon annotation 
were (i) the P promoter must be followed by sequence read cov- 
erage across the entire operon, (ii) the mapped 3' ends must ex- 
tend beyond the stop codon of the last gene in the operon, (hi) the 
S and I promoters must increase read coverage of the downstream 
bases, and (iv) internal terminators must decrease coverage of 
downstream bases without interrupting contiguous coverage by 
readthrough transcripts. Our search for 3 ' ends that can be asso- 
ciated with annotated promoter(s) by deep sequence read cover- 
age throughout the operon led to mapping 1,774 candidate termi- 
nators (see Table S3 in the supplemental material), 264 of which 
lie within operons and apparently permit partial readthrough 
transcription of downstream genes, as demonstrated for the 
sdhCDAB-sucABCD operon (Fig. 3). We evaluated the 1,774 3' 
ends by using TransTermHP (63) and confirmed that 623 have 
sequences characteristic of intrinsic terminators, which extends 
the number of annotated E. coli terminators previously annotated 
(227 [38]) by nearly 8-fold. It has been predicted that roughly 
one-half of terminators are intrinsic (64). The remaining 1,151 
terminators that were not confirmed by TransTermHP are candi- 
dates for ones requiring Rho or another protein factor for termi- 
nation. Since there is no computational approach to identify 
factor-dependent terminators, the data in Table S3 represent the 
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most extensive genome-wide prediction of nonintrinsic termina- 
tors. 

The preceding analyses of only logarithmic- and stationary- 
phase samples revealed a total of 6,463 regulatory features, includ- 
ing 2,122 promoters (see Table S2), 1,774 terminators (see Ta- 
ble S3), and 2,566 transcription units (TUs) corresponding to 
1,510 operons (see Table S4). The sequence reads covered more 
than 90% of bases within 90% of the annotated operons. The 
1,510 operons cover 2,985 of 4,457 genes (approximately two- 
thirds) annotated on the reference genome. As more data sets and 
growth conditions are analyzed, our simple organizational 
schema should make it straightforward to add newly identified 
regulatory features to the E. coli K-12 transcriptome map. For 
ready distribution, we converted our data sets to GenBank format 
by using "promoter," "terminator," and "operon" as feature keys 
(65). This data format allows annotation of any number of exper- 
imental parameters that affect the usage of these features. Our 
E. coli K-12 transcriptome annotation GenBank feature table is 
available from GEO (accession no. GSE52059). 

Operon organization examples. The data in Fig. 2 unequivo- 
cally confirm that the E. coli genome is organized in operons. In its 
original conception, the operon has a regulatory region with a 
single promoter that initiates transcription of a polycistronic 
mRNA covering the lac operon genes and ends with a single ter- 



minator. Indeed, many E. coli operons fit this model or are even 
simpler if they contain a single gene (monocistronic). However, 
the whole E. coli transcriptome reveals densely packed regulatory 
features that cannot be discerned from the genome sequence alone 
(Fig. 2). Complex operons result from transcripts initiated by S 
and I promoters, as well as internal terminators. For example, sulA 
and ompA are independently transcribed during logarithmic 
phase, with each gene having its own promoter and terminator. 
However, during stationary phase, the sulA TU reads through a 
nonintrinsic sulA terminator to form the sulA-ompA transcript, 
driven by an S promoter that increases expression of the ompA TU 
(Fig. 2). An AS transcript that fully overlaps the sulA coding se- 
quence is also switched on in stationary phase. This arrangement 
of the sulA-ompA operon and AS transcript was postulated as a 
means for posttranscriptional control of the synthesis of the cell 
division inhibitor SulA (66), which is further supported by our 
results. Our organizational schema makes the previously unanno- 
tated sulA AS transcript (12) and similar regulatory features read- 
ily apparent on the sulA-ompA transcriptome map (Fig. 2). Such 
differential expression of TUs within operons can provide bacteria 
with the ability to modulate gene expression to cope with physio- 
logical complexity (28, 30, 34, 41). 

It is especially notable that Fig. 2 reveals the E. coli transcrip- 
tome for only two growth conditions, log phase and stationary 
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FIG 4 Computational analysis of single-nucleotide resolution data reveals complex operon architecture. (A) Operons organized by increasing complexity; (B) 
TU usage plot of HgT-sfsA-dksA-yadB-pcnB-floK operon. The primary TU corresponding to the entire operon is shown in red. The differentially expressed 
d/csA- specific TU driven by promoter 1-161376 is shown in purple. The pcnB-folKTU driven by S- 159 171 is shown in orange. Note that transcript levels of dksA 
increase upon entry into stationary phase, whereas pcnB-folK decreases. (C) JBrowse instance showing HgT-sfsA-dksA-yadB-pcnB-floKoperon; (D) TU usage plot 
of ybdK-ybdJ-ybdF-nrsB-mbcM operon. Note the primary TU corresponding to the entire operon (red) decreases only slightly during transition from logarithmic 
phase into stationary phase, because it is comprised of two differentially expressed TUs, one of which increases and the other decreases during growth: the 
nfsB-mbcM-specific transcript (orange) essentially disappears in stationary phase, whereas the ybdK- specific transcript (blue) is induced in stationary phase. (E) 
JBrowse instance of ybdK-ybdJ-ybdF-nrsB-mbcM operon. 



phase, due to carbon source limitation. Our analyses show that 
29% of operons have more than one promoter and that 15% of 
operons have more than one terminator under these conditions 
(Fig. 4). Further, many operons are subject to multiple regulatory 
inputs (38) that have not been examined here. Differential mRNA 
decay can also provide an additional layer of control within oper- 
ons (62). No doubt, future RNA-Seq analysis of the myriad re- 
sponses to numerous regulatory signals is likely to reveal substan- 
tially more variation in operon architecture, as seen for Salmonella 
(41). 

Intricacy is readily apparent for operons with internal promot- 
ers and terminators. For example, three promoters upstream of 
the ahpCF operon contribute to its expression in an additive fash- 
ion (Fig. 5). Such an arrangement allows differential control of 
alkylhydroperoxidase production in response to stationary phase, 
osmotic stress, and oxidative stress (67). Likewise, three promot- 
ers contribute to ybfE-fldA-uof-fur operon expression during log- 
arithmic phase, allowing for continuation of uof-fur TU expres- 
sion, decline oifldA expression, and turnoff of ybfE expression in 
the stationary phase (Fig. 5). Although cotranscription of the 
complex ybfE-fldA-uof-fur operon was not previously recognized 
(68), it makes sense for uof-fur to be transcribed independently of 
ybfE-fldA under certain conditions, because/wr encodes a negative 
regulator of genes for iron uptake. Furthermore, uof expression is 
controlled indirectly by the trans- acting noncoding RNA RhyB, 
which is itself Fur regulated, thus forming a negative feedback 
loop responsive to iron limitation (68). The ability to unravel 
condition-specific terminator usage by our organizational schema 
is illustrated for the internal terminator of the sdhCDAB- 
sucABCD operon, which encodes three enzymes of the tricarbox- 
ylic acid (TCA) cycle (Fig. 3). This arrangement explains how 
intrinsic termination can allow one operon to function indepen- 
dently as two operons under appropriate conditions (69). These 
examples demonstrate how our promoter and terminator usage 



calculations can reveal new biological insights from the RNA-Seq 
transcriptome analyses. 

Catalogue of operon architecture. High- resolution mapping 
of well- characterized regions of the genome provided glimpses of 
intricate operon arrangements (Fig. 2 to 5). Our analyses of E. coll 
operons at single-nucleotide resolution further revealed numer- 
ous instances of complexity genome-wide. Single-gene operons 
with a single promoter and terminator make up 47% of all oper- 
ons, while 17% are "traditional" operons with more than one gene 
and a single promoter and terminator (Fig. 4). The remaining 
operons (36%) are more complex: 21% have multiple (as many as 
8) promoters, 7% have multiple (as many as 4) terminators, and 
8% have both multiple promoters and multiple terminators. The 
average operon has 1.98 genes (see Table S4 in the supplemental 
material). In our data set, the most complex operon, which en- 
codes several core cellular functions, has 14 genes, 8 promoters, 4 
terminators, and 23 TUs (yjeF-yjeE-amiB-mutL-miaA-hfq-hflX- 
hflK-hflC-yjeT-purA-nsrR-rnr-rlmB operon; see Fig. S4). 

Differential TU expression within operons can result from the 
activity of S and I promoters, internal terminators, and combina- 
tions of these regulatory features. For example, Fig. 4 illustrates 
how an I promoter and internal terminator can function together 
to increase expression of the DksA- specific TU in stationary 
phases. For the ybdK operon, Fig. 4 shows differential expression 
of the 5 ' and 3 ' TUs of the operon caused by transcription from an 
S promoter and an internal terminator. This arrangement of fea- 
tures results in a complete inversion in expression of the 2 TUs 
between logarithmic and stationary phases. These findings suggest 
that operon architecture permits E. coll to adjust relative levels of 
gene expression within the same operon in response to environ- 
mental conditions. 

To quantify differential gene expression within E. coll operons, 
we compared the base counts of TUs within the same operon 
under the same growth condition and tabulated the complexity 
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Time point Promoter 




-P-711173 
-S-710744 
S-710051 



FIG 5 Three promoters contribute to expression levels of genes within the ahpCF and the ybfE-fldA-uof-fur operons. (A) WT time series of TU base counts of 
three overlapping TUs within the ahpCF operon; (B) usage of 3 ahpC promoters (10-base average from TSS + 1 to + 10) during logarithmic phase (time point 
4); (C) TU coverage time series of the ybfE-fldA-uof-fur operon; (D) differential usage of three promoters within the ybfE-fldA-uof-fur operon during log phase. 
Promoter usage and TU coverage calculations are described in the legend to Fig. 1. 



that arises from internal promoters and terminators (see Table S6 
in the supplemental material). Of 548 complex operons displaying 
multiple TUs due to having multiple promoters or terminators 
(Fig. 4), 327 showed more than 2-fold differential expression of 
1 TU compared to other TUs within the same operon (see Ta- 
ble S6). Of 633 operons containing more than one gene, we ob- 
served 2-fold or greater differential gene expression within 315 
operons (e.g., see Fig. 4). In the case of polycistronic operons that 
have only a single promoter and terminator, it appears that differ- 
ential decay of the processed transcripts is responsible. In total, 
43% (642 of 1,510) of all E. coli operons show a complex gene 
expression regulatory pattern (see Table S6). Clearly, differential 
expression of TUs and genes within the same operon is common 
in E. coll. 

Our analyses provided the opportunity to map potential AS 
transcription across the transcriptome. In many cases, AS tran- 
scripts completely overlap and are complementary to sense strand 
transcripts that encode proteins; however, these AS transcripts do 
not appear to encode proteins. For example, the long AS RNA that 
is complementary to sulA does not appear to be translated, be- 
cause it has no properly positioned ribosome binding site nearby a 
start codon and thus is likely to be a long noncoding RNA (ln- 
cRNA). We found 18 transcripts for annotated protein-coding 
genes and small RNAs that completely overlap operons tran- 
scribed in the opposite direction (see Table S5). As a result of this 



arrangement, the 18 corresponding operons contain long non- 
coding AS transcripts that overlap the coding sequences on the 
opposite strand. 

Since genome annotation relies heavily on identification of 
coding sequences, we predicted that our transcriptome analysis 
would reveal unannotated genes. Indeed, we identified 96 tran- 
scripts that do not correspond to genes on the reference genome 
and were previously unannotated in E. coli K-12 (see Table S5). 
These include 89 AS transcripts that have an average length of 397 
bases, the longest of which is 1,168 bases. The remaining 7 tran- 
scripts are completely intergenic and do not overlap annotated 
genes. None of the 96 transcripts appear to code for protein be- 
cause they all have multiple stop codons in all three reading 
frames. Of the 89 AS transcripts, 21 are convergent with known 
operons that code for proteins, 7 are divergent with mapped oper- 
ons, and 40 completely overlap annotated operons. The remain- 
ing 21 AS transcripts overlap known genes that could not be an- 
notated into operons by RNA-Seq. The genomic regions 
corresponding to 72% of these IncRNAs are highly conserved in 
>50 E. coli and Shigella genomes. It was proposed previously that 
bacterial IncRNAs may be functional (30, 34), yet this was ques- 
tioned by others (25). Similar IncRNAs have also been found in 
eukaryotes, and although they are not well understood, they are 
thought to play a role in regulating gene expression (70). 

A recent study of terminator efficiency showed that only 3% of 
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E. coli terminators are "strong" (71). Inefficient termination 
would explain how convergent operons sometimes have overlap- 
ping transcription (19, 20). Therefore, we tested the hypothesis 
that partial termination between convergent operons would gen- 
erate complementary 3 ' transcript ends and add further complex- 
ity to the transcriptome. Figure 1 shows an intrinsic terminator 
located between convergent operons, which terminates transcrip- 
tion by 4-fold yet allows readthrough transcription of 329 bases of 
AS RNA for the 3 ' end of the convergent operon transcript. Our 
analyses of 370 instances of convergent operons revealed that 75% 
show transcription into an adjacent operon to generate comple- 
mentary 3 ' transcript ends that overlap by an average of 286 bases, 
the longest of which is 1,395 bases (see Table S5). In genome 
regions where there are many highly transcribed operons, it is 
more likely to observe convergent transcription. Of the genome 
regions corresponding to these convergent operons, 74% are 
highly conserved at the nucleotide sequence level in >50 E. coli 
(and Shigella) complete genomes. It is thus reasonable to suggest 
that overlapping transcription of convergent operons is a general 
property in bacteria. 

Transcription of divergent operons can result in overlapping 
transcripts (17, 18). Complementary transcripts generated by di- 
vergent promoters recently have been called "excludons," which 
are thought to act as negative regulators of genes on the opposite 
strand (34). Our analyses of 388 instances of divergent operons 
revealed that 35% have promoters arranged such that their 5' 
transcript ends overlap by an average of 168 bases, the longest of 
which is 1,012 bases (see Table S5). The genome regions corre- 
sponding to 81% of these overlapping divergent operons are 
highly conserved in >50 E. coli (and Shigella) genomes. The find- 
ing of sequence conservation says nothing about functionality, 
but our finding that over one-third of divergent operons generate 
overlapping complementary transcripts does suggest that ex- 
cludons may be prevalent in bacteria. 

Comparison to other data sets. We compared our AS tran- 
script annotations to other high-quality data sets using a conser- 
vative approach. We compared our data sets to highly expressed 
and experimentally verified AS transcripts from those studies. A 
contemporaneous single-nucleotide analysis of the E. coli tran- 
scriptome by Storz, Sharma, and colleagues (submitted for publi- 
cation) focused on AS transcripts. They found that most previ- 
ously annotated sRNAs are present at high levels, so we compared 
our AS RNA data set to the most highly expressed AS RNAs in 
their study. Our data corroborate 74 of their 127 highest- 
expressed AS RNAs. Furthermore, we corroborated 6 of 14 candi- 
date AS RNAs tested on Northern blots by the Storz group. How- 
ever, while their gels verified 6 of the 14, we corroborated only 2 of 
those 6, indicating that there is substantial variability in these 
high-throughput data sets. A recent coimmunoprecipitation 
study of the double -stranded E. coli transcriptome revealed 316 
double-stranded RNAs, including partially and fully overlapping 
transcripts as well as many generated by divergent and convergent 
operons (33). Our analyses predicted AS RNAs corresponding to 
13 of 21 double-stranded RNAs that were verified in Northern 
blot analysis (33). It is tempting to speculate that AS RNAs that are 
corroborated by RNA-Seq studies, are verified by Northern blot 
analysis, and correspond to highly conserved genome sequences 
are functional. However, functions have been confirmed for only 
a limited number of AS RNAs (56, 72). It remains to be seen how 
many of the AS RNAs identified by RNA-Seq will prove to be 



expressed in the same cell as the sense transcript and display a yet 
unknown phenotype. 

Bacterial operons compared to eukaryotic genes. It did not 

escape our attention that the widespread occurrence of bacterial 
operons with multiple TUs in some ways resembles alternative 
splicing of eukaryotic transcripts. From both bacterial operons 
and eukaryotic genes arise primary transcripts that are divided 
into alternative transcripts by the activity of transcriptional regu- 
latory features, i.e., internal promoters and terminators in bacteria 
and RNA splice junctions in eukaryotes. The potential for eukary- 
otic gene complexity is reflected in the number of exons per gene. 
The number of exons per gene in Saccharomyces cerevisiae is 1.1 

(73) , which is considerably fewer than the 1.7 TUs per operon in 
E. coli. In contrast, higher organisms have 4 to 9 introns per gene 

(74) , making them more complex than E. coli. Perhaps the loss of 
exons that is proposed to have happened in budding yeasts during 
evolution from more primitive eukaryotes accentuates their dif- 
ference from E. coli and higher organisms (75). We conclude that 
E. coli possesses operon complexity comparable to analogous gene 
structures in budding yeasts. 

Concluding statement. This study reveals the power of single- 
nucleotide resolved RNA-Seq data sets for pinpointing transcrip- 
tional features across the genome, which we used to annotate 
operons by precisely mapping their 5' and 3' ends. We found an 
astounding level of overlapping transcription where complemen- 
tary RNAs are transcribed from both strands, such as those gen- 
erated by several hundred convergent and divergent operons. We 
discovered more than 100 long AS transcripts overlapping oper- 
ons that also were transcribed on the sense strand. In sum, we 
found that approximately one in three (519 out of 1,510) operons 
at least partially overlaps with other operons to generate AS RNA. 
These AS transcripts are highly conserved in E. coli and appear to 
be noncoding RNA, suggesting that they are involved in regula- 
tion of gene expression, as has been proposed for excludons in 
bacteria (34) and IncRNAs in eukaryotes (70). We also found 7 
transcripts that did not correspond to an annotated gene and 
therefore represent previously unrecognized yet potentially func- 
tional operons. The transcriptome intricacy we observed in E. coli 
appears to be a general property of the domain bacteria, as the 
transcriptomes of several other bacteria appear to be similarly 
intricate (21,26,28,31,41, 47-49) . Whether the same is true of the 
Archaea must await high- resolution RNA-Seq analysis of repre- 
sentatives of this domain of life (83). Since operon arrangements 
are more highly conserved than gene repertoires (76), it is inter- 
esting to speculate that the requirements of primordial life led to 
the evolution of an operon architecture in bacteria which accom- 
modates substantial variation in gene expression. 

MATERIALS AND METHODS 

Bacterial strains and growth conditions. To annotate operons and char- 
acterize their response to carbon starvation, wild-type E. coli BW38028 
and£. co/z' BW39452 (ArpoS::cat) were grown in 1 liter of morpholinepro- 
panesulfonic acid (MOPS) minimal medium (77) containing 0.2% glu- 
cose in a fermenter at 37°C with constant pH and aeration. MOPS me- 
dium solutions were modified as described elsewhere (78), which permits 
preparation of 40 X "M" stock solution, giving the same final medium 
recipe (77). Cultures were sampled at 10 time points during growth of 
E. coli BW38028 and at five time points for E. coli BW39452, as shown in 
Fig. SI in the supplemental material. Logarithmic- and stationary-phase 
samples were duplicated from replicate cultures. 
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RNA sequencing. RNA was prepared by using an RNeasy kit (Qiagen, 
USA). Because small RNAs may be preferentially lost during column pu- 
rification, they are likely underrepresented in our data sets. Replicates of 
logarithmic- and stationary-phase RNA were treated with Terminator 
5 '-phosphate-dependent exonuclease (Epicenter, USA) to enrich 5'- 
triphosphate mRNA fragments for TSS mapping. RNA sequencing librar- 
ies (see Table SI) were prepared by using the strand-specific, ligation- 
based SOLiD Total RNA-Seq kit. Paired-end sequencing was performed 
on the SOLiD 4 Genetic Analyzer at Purdue University Genomics Facility. 

Raw data processing. Sequence reads were aligned to the E. coli 
MG1655 reference genome (U00096.2) with Bowtie version 1.8 (79). For 
the first pass, we used paired-end color space mapping with a distance 
cutoff of 350 bases between read mates. Bowtie parameters were set to 
include only perfect matches and retained only one alignment where a 
read mapped to more than one genome location. In practice, we found the 
efficiency of paired-end mapping was between 3 and 1 0%. To improve the 
overall alignment, we mapped the orphan 5'- and 3 '-end reads in two 
additional passes with Bowtie (one for the 5' reads and one for the 3' 
reads). The output of the three passes through Bowtie was three SAM files 
for each sample. Overall, we achieved 40 to 60% mapping efficiency with 
this three-pass strategy. SAMtools (80) utilities were used to convert SAM 
files to BAM format and to sort and index them. The binary read align- 
ment (BAM) files were displayed in Integrated Genome Viewer (IGV 
version 2) for primary analysis and quality control. The BAM files were 
then converted to base count (WIG) files. We accomplished this by using 
an in-house script to extract strand-specific base count data from BAM 
files (outputs are positive- and negative-strand WIG files). First, our 
solidbam2wig.pl script reads in the paired-end BAM file and counts the 
nucleotides spanning inserts between the mated 5' and 3' reads. Next, the 
script pulls in the orphan 5' and 3' reads from the respective BAM files 
and increments the base counts at each base location without duplicating 
the reads already incremented from the paired ends. Base count data were 
then normalized based on the assumption that reads are randomly dis- 
tributed across the genome and that if sequencing was sufficiently deep, all 
expressed transcripts would be represented in the data set (39). In prac- 
tice, SOLiD sequencing did not generate data sets in which the lowest- 
abundance transcripts were fully covered by contiguous reads. In addi- 
tion, inefficient ribo- depletion can bias the number of reads that map to 
non-rRNA genes. Our normalization strategy accounts for both of these 
factors by maximizing TU coverage and removing rRNA reads during 
data processing. Our in-house script, normWIG.pl, reads in the raw WIG 
files. A simple global normalization approach was utilized that multiplied 
the count at each base location by 1 billion and divides that value by the 
sum of base counts at all base locations in the file. This normalization 
strategy is analogous to the total count approach used for normalizing 
gene-specific read alignments (51). In this way, the base counts are ex- 
pressed as parts per billion. For display in JBrowse (61), the normalized 
WIG files were converted to BIGWIG files by using SAMtools (80). Anal- 
ysis of the data was conducted in a graphic user interface consisting of 
JBrowse (61) and an Oracle database. 

SELEX. Genomic SELEX was previously described (81). Antibodies 
against RpoD sigma, RpoS sigma, and core enzyme subunits were pro- 
duced in rabbits by injecting purified sigma proteins (82). 

Nucleotide sequence accession number. RNA sequencing data and 
curated results were deposited at Gene Expression Omnibus, accession 
no. GSE52059. 
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